{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hyperparameter tuning\n",
    "\n",
    "In the previous section, we did not discuss the hyperparameters of random\n",
    "forest and histogram gradient-boosting. This notebook gives crucial\n",
    "information regarding how to set them.\n",
    "\n",
    "<div class=\"admonition caution alert alert-warning\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n",
    "<p class=\"last\">For the sake of clarity, no nested cross-validation is used to estimate the\n",
    "variability of the testing error. We are only showing the effect of the\n",
    "parameters on the validation set.</p>\n",
    "</div>\n",
    "\n",
    "We start by loading the california housing dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_california_housing\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "data, target = fetch_california_housing(return_X_y=True, as_frame=True)\n",
    "target *= 100  # rescale the target in k$\n",
    "data_train, data_test, target_train, target_test = train_test_split(\n",
    "    data, target, random_state=0\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Random forest\n",
    "\n",
    "The main parameter to select in random forest is the `n_estimators` parameter.\n",
    "In general, the more trees in the forest, the better the generalization\n",
    "performance would be. However, adding trees slows down the fitting and prediction\n",
    "time. The goal is to balance computing time and generalization performance\n",
    "when setting the number of estimators. Here, we fix `n_estimators=100`, which\n",
    "is already the default value.\n",
    "\n",
    "<div class=\"admonition caution alert alert-warning\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n",
    "<p class=\"last\">Tuning the <tt class=\"docutils literal\">n_estimators</tt> for random forests generally result in a waste of\n",
    "computer power. We just need to ensure that it is large enough so that doubling\n",
    "its value does not lead to a significant improvement of the validation error.</p>\n",
    "</div>\n",
    "\n",
    "Instead, we can tune the hyperparameter `max_features`, which controls the\n",
    "size of the random subset of features to consider when looking for the best\n",
    "split when growing the trees: smaller values for `max_features` lead to\n",
    "more random trees with hopefully more uncorrelated prediction errors. However\n",
    "if `max_features` is too small, predictions can be too random, even after\n",
    "averaging with the trees in the ensemble.\n",
    "\n",
    "If `max_features` is set to `None`, then this is equivalent to setting\n",
    "`max_features=n_features` which means that the only source of randomness in\n",
    "the random forest is the bagging procedure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"In this case, n_features={len(data.columns)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also tune the different parameters that control the depth of each tree\n",
    "in the forest. Two parameters are important for this: `max_depth` and\n",
    "`max_leaf_nodes`. They differ in the way they control the tree structure.\n",
    "Indeed, `max_depth` enforces growing symmetric trees, while `max_leaf_nodes`\n",
    "does not impose such constraint. If `max_leaf_nodes=None` then the number of\n",
    "leaf nodes is unlimited.\n",
    "\n",
    "The hyperparameter `min_samples_leaf` controls the minimum number of samples\n",
    "required to be at a leaf node. This means that a split point (at any depth) is\n",
    "only done if it leaves at least `min_samples_leaf` training samples in each of\n",
    "the left and right branches. A small value for `min_samples_leaf` means that\n",
    "some samples can become isolated when a tree is deep, promoting overfitting. A\n",
    "large value would prevent deep trees, which can lead to underfitting.\n",
    "\n",
    "Be aware that with random forest, trees are expected to be deep since we are\n",
    "seeking to overfit each tree on each bootstrap sample. Overfitting is\n",
    "mitigated when combining the trees altogether, whereas assembling underfitted\n",
    "trees (i.e. shallow trees) might also lead to an underfitted forest."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sklearn.model_selection import RandomizedSearchCV\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "\n",
    "param_distributions = {\n",
    "    \"max_features\": [1, 2, 3, 5, None],\n",
    "    \"max_leaf_nodes\": [10, 100, 1000, None],\n",
    "    \"min_samples_leaf\": [1, 2, 5, 10, 20, 50, 100],\n",
    "}\n",
    "search_cv = RandomizedSearchCV(\n",
    "    RandomForestRegressor(n_jobs=2),\n",
    "    param_distributions=param_distributions,\n",
    "    scoring=\"neg_mean_absolute_error\",\n",
    "    n_iter=10,\n",
    "    random_state=0,\n",
    "    # n_jobs=2,  # Uncomment this line if you run locally\n",
    ")\n",
    "search_cv.fit(data_train, target_train)\n",
    "\n",
    "columns = [f\"param_{name}\" for name in param_distributions.keys()]\n",
    "columns += [\"mean_test_error\", \"std_test_error\"]\n",
    "cv_results = pd.DataFrame(search_cv.cv_results_)\n",
    "cv_results[\"mean_test_error\"] = -cv_results[\"mean_test_score\"]\n",
    "cv_results[\"std_test_error\"] = cv_results[\"std_test_score\"]\n",
    "cv_results[columns].sort_values(by=\"mean_test_error\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can observe in our search that we are required to have a large number of\n",
    "`max_leaf_nodes` and thus deep trees. This parameter seems particularly\n",
    "impactful with respect to the other tuning parameters, but large values of\n",
    "`min_samples_leaf` seem to reduce the performance of the model.\n",
    "\n",
    "In practice, more iterations of random search would be necessary to precisely\n",
    "assert the role of each parameters. Using `n_iter=10` is good enough to\n",
    "quickly inspect the hyperparameter combinations that yield models that work\n",
    "well enough without spending too much computational resources. Feel free to\n",
    "try more interations on your own.\n",
    "\n",
    "Once the `RandomizedSearchCV` has found the best set of hyperparameters, it\n",
    "uses them to refit the model using the full training set. To estimate the\n",
    "generalization performance of the best model it suffices to call `.score` on\n",
    "the unseen data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "error = -search_cv.score(data_test, target_test)\n",
    "print(\n",
    "    f\"On average, our random forest regressor makes an error of {error:.2f} k$\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Histogram gradient-boosting decision trees\n",
    "\n",
    "For gradient-boosting, hyperparameters are coupled, so we cannot set them\n",
    "one after the other anymore. The important hyperparameters are `max_iter`,\n",
    "`learning_rate`, and `max_depth` or `max_leaf_nodes` (as previously discussed\n",
    "random forest).\n",
    "\n",
    "Let's first discuss `max_iter` which, similarly to the `n_estimators`\n",
    "hyperparameter in random forests, controls the number of trees in the\n",
    "estimator. The difference is that the actual number of trees trained by the\n",
    "model is not entirely set by the user, but depends also on the stopping\n",
    "criteria: the number of trees can be lower than `max_iter` if adding a new\n",
    "tree does not improve the model enough. We will give more details on this in\n",
    "the next exercise.\n",
    "\n",
    "The depth of the trees is controlled by `max_depth` (or `max_leaf_nodes`). We\n",
    "saw in the section on gradient-boosting that boosting algorithms fit the error\n",
    "of the previous tree in the ensemble. Thus, fitting fully grown trees would be\n",
    "detrimental. Indeed, the first tree of the ensemble would perfectly fit\n",
    "(overfit) the data and thus no subsequent tree would be required, since there\n",
    "would be no residuals. Therefore, the tree used in gradient-boosting should\n",
    "have a low depth, typically between 3 to 8 levels, or few leaves ($2^3=8$ to\n",
    "$2^8=256$). Having very weak learners at each step helps reducing overfitting.\n",
    "\n",
    "With this consideration in mind, the deeper the trees, the faster the\n",
    "residuals are corrected and then less learners are required. Therefore,\n",
    "it can be beneficial to increase `max_iter` if `max_depth` is low.\n",
    "\n",
    "Finally, we have overlooked the impact of the `learning_rate` parameter\n",
    "until now. This parameter controls how much each correction contributes to the\n",
    "final prediction. A smaller learning-rate means the corrections of a new\n",
    "tree result in small adjustments to the model prediction. When the\n",
    "learning-rate is small, the model generally needs more trees to achieve good\n",
    "performance. A higher learning-rate makes larger adjustments with each tree,\n",
    "which requires fewer trees and trains faster, at the risk of overfitting. The\n",
    "learning-rate needs to be tuned by hyperparameter tuning to obtain the best\n",
    "value that results in a model with good generalization performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import loguniform\n",
    "from sklearn.ensemble import HistGradientBoostingRegressor\n",
    "\n",
    "param_distributions = {\n",
    "    \"max_iter\": [3, 10, 30, 100, 300, 1000],\n",
    "    \"max_leaf_nodes\": [2, 5, 10, 20, 50, 100],\n",
    "    \"learning_rate\": loguniform(0.01, 1),\n",
    "}\n",
    "search_cv = RandomizedSearchCV(\n",
    "    HistGradientBoostingRegressor(),\n",
    "    param_distributions=param_distributions,\n",
    "    scoring=\"neg_mean_absolute_error\",\n",
    "    n_iter=20,\n",
    "    random_state=0,\n",
    "    # n_jobs=2, # Uncomment this line if you run locally\n",
    ")\n",
    "search_cv.fit(data_train, target_train)\n",
    "\n",
    "columns = [f\"param_{name}\" for name in param_distributions.keys()]\n",
    "columns += [\"mean_test_error\", \"std_test_error\"]\n",
    "cv_results = pd.DataFrame(search_cv.cv_results_)\n",
    "cv_results[\"mean_test_error\"] = -cv_results[\"mean_test_score\"]\n",
    "cv_results[\"std_test_error\"] = cv_results[\"std_test_score\"]\n",
    "cv_results[columns].sort_values(by=\"mean_test_error\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "<div class=\"admonition caution alert alert-warning\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n",
    "<p class=\"last\">Here, we tune <tt class=\"docutils literal\">max_iter</tt> but be aware that it is better to set <tt class=\"docutils literal\">max_iter</tt> to a\n",
    "fixed, large enough value and use parameters linked to <tt class=\"docutils literal\">early_stopping</tt> as we\n",
    "will do in Exercise M6.04.</p>\n",
    "</div>\n",
    "\n",
    "In this search, we observe that for the best ranked models, having a\n",
    "smaller `learning_rate`, requires more trees or a larger number of leaves\n",
    "for each tree. However, it is particularly difficult to draw more detailed\n",
    "conclusions since the best value of each hyperparameter depends on the other\n",
    "hyperparameter values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now estimate the generalization performance of the best model using the\n",
    "test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "error = -search_cv.score(data_test, target_test)\n",
    "print(f\"On average, our HGBT regressor makes an error of {error:.2f} k$\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The mean test score in the held-out test set is slightly better than the score\n",
    "of the best model. The reason is that the final model is refitted on the whole\n",
    "training set and therefore, on more data than the cross-validated models of\n",
    "the grid search procedure.\n",
    "\n",
    "We summarize these details in the following table:\n",
    "\n",
    "| **Bagging & Random Forests**                     | **Boosting**                                        |\n",
    "|--------------------------------------------------|-----------------------------------------------------|\n",
    "| fit trees **independently**                      | fit trees **sequentially**                          |\n",
    "| each **deep tree overfits**                      | each **shallow tree underfits**                     |\n",
    "| averaging the tree predictions **reduces overfitting** | sequentially adding trees **reduces underfitting** |\n",
    "| generalization improves with the number of trees | too many trees may cause overfitting                |\n",
    "| does not have a `learning_rate` parameter        | fitting the residuals is controlled by the `learning_rate` |"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}